Accelerating linear algebra kernels for any processor architecture

ABSTRACT

Systems and methods for obtaining a set of instructions for executing a computer program and generating executable code for the computer program based, at least in part, on scheduling operations associated with the executable code according to a polyhedral representation of a directed acyclic graph. The set of instructions may be represented as a domain-specific language. The executable code may be executable code for a specific processor architecture.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.16/277,661 filed Feb. 15, 2019, entitled ACCELERATING LINEAR ALGEBRAKERNELS FOR ANY PROCESSOR ARCHITECTURE,” which claims benefit of U.S.Provisional Application No. 62/641,121 entitled “A SYSTEM AND METHOD FOREXPRESSING AND COMPILING LINEAR ALGEBRA KERNELS” filed on Mar. 9, 2018.

BACKGROUND

Graphics Processing Units (GPUs) play a major role in scientificcomputations, and in neural network training and inference. Regularcomputational structure of scientific computing codes makesgeneral-purpose GPUs (GPGPUs) a good fit to perform these computationswith high efficiency and low energy. Matrix-matrix multiplication andmany other linear algebra computations form the basis for severalscientific computation and neural network algorithms. Hence, havingefficient implementations of basic linear algebra computations fordifferent GPU architectures is critical for the overall performance.

However, there are many challenges surrounding evolving architectures,and high performing linear algebra kernels have to be written and tunedfor each current and upcoming GPUs. It is becoming increasinglydifficult for programmers and researchers is to manually write thesekernels for each individual architecture in GPU's assembly-levellanguage (SASS), and hand tune them for improved performance. Thisinvolves significant and continuous manual effort.

In addition, scientific computation codes are generally composed ofvarious sub-computations in different combinations. For example,training a neural network involves performing a matrix-multiplication,followed by adding a bias, and applying an activation function. Thereare an impractical number of possible combinations of sub-computationalsequences to be optimized, making it impractical to manually write andhand-tune all possible combinations. Further, there are severalscenarios (including machine-learning) in which it is important tospecialize implementation of these kernels for specific problem sizes.Accordingly, there are many challenges surrounding how to efficientlygenerate kernels for these computations.

BRIEF DESCRIPTION OF THE DRAWINGS

Various techniques will be described with reference to the drawings, inwhich:

FIG. 1 illustrates a computing environment in which a domain-specificlanguage (DSL) compiler is implemented, in accordance with oneembodiment;

FIG. 2 illustrates a computing environment in which a program isimplemented according to techniques described here to generate optimizedexecutable code, in accordance with one embodiment;

FIG. 3 illustrates a computing environment showing an architecture ofthe DSL compilation process, in accordance with one embodiment;

FIG. 4 shows an illustrative example of a process for generatingoptimized executable code, in accordance with an embodiment;

FIG. 5 illustrates a parallel processing unit (PPU), in accordance withone embodiment;

FIG. 6 illustrates a general processing cluster (GPC), in accordancewith one embodiment;

FIG. 7 illustrates a memory partition unit, in accordance with oneembodiment;

FIG. 8 illustrates a streaming multiprocessor, in accordance with oneembodiment; and

FIG. 9 illustrates a computer system that can be implemented inconnection with a PPU, in accordance with one embodiment.

DETAILED DESCRIPTION

In an embodiment, a domain-specific language (DSL) refers to a computerlanguage or programming language having a limited purpose and may have alimited expressiveness with regard to the types of operations supportedby the DSL. In an embodiment, a DSL is a non-Turing-complete language. Adomain-specific language may be described in contrast to ageneral-purpose language (GPL) which is broadly applicable acrossapplication domains and lacks specialized features for a particulardomain. Examples of GPLs include general-purpose markup languages suchas XML, general-purpose modeling languages such as the Unified ModelingLanguage (UML), and general-purpose programming languages such as C++,Java, PHP, and Python. In an embodiment, a domain-specific language hasa syntax or grammar that includes various types of primitives such asdeclarations, statements, and specific calls such as a function call forcode generation. In an embodiment, the term “CodeGen” is used todescribe an operation that causes generation of optimized code, such asin accordance with techniques described in connection with FIG. 4 butother suitable terms to reference such an operation are alsocontemplated as being within the scope of this disclosure as specificexamples (e.g., “CodeGen” function) are merely illustrative in nature.In an embodiment, lists of input and output matrices are declared at thebeginning of a DSL input file. In an embodiment, the input matrixdimension sizes are specified in the matrix declaration. In anembodiment, the dimensions of output matrices are automatically derivedor calculated from the matrix dimension sizes of the input matrices andare not required to be specified by a user. In an embodiment,declarations are followed by sequences o expressions composed of matrixoperations. In an embodiment, expression are described in a staticsingle assignment (SSA) format wherein each matrix can appear at mostonce in the left-hand side of assignment operators. In an embodiment,various operations are supported by the DSL. In an embodiment, supportedarithmetic operations include one or more of the following: addition(+), which may refer to pointwise addition of matrix elements;subtraction (−), which may refer to pointwise subtraction of matrixelements; multiplication(*), which may refer to multiplication of twomatrices. In an embodiment, the DSL supports assignment (=) andtransposition operations ({circumflex over ( )}T). In an embodiment,various parameters required for code generation are specified; theseelements may include, for example: element ty of the matrices, list ofoutput matrices; operational prefix for the filename of the generatedfiles; and any suitable combination thereof. In an embodiment, if nofilename prefix is specified, the files are generated with a defaultprefix.

In an embodiment, the DSL input file can be configured to specifyvarious parameters such as tile sizes; various configurations can beconfigured with default values. In an embodiment, one or more of thefollowing parameters are configurable and/or have default values if notconfigured: warpsize (default=32); shared memory bank size (default=32);warp level tilesizes, thread level tilesizes; vector size for globalmemory loads and stores (default=2); vector size for shared memory loadsand stores (default=4); swizzling support (default=supported); swizzlingbit pattern; unrolling support (default=supported); block-cyclicdistribution (default=supported); and any suitable combination thereof.

In an embodiment, an input or expression file is encoded based on thefollowing:

Matrix A (512 ,512); Matrix B (512 ,512); Matrix C; Matrix D; C = A*B; D= C+A; CodeGen ( float32 , {C, D}, ex1 );

In an embodiment, a configuration file is encoded based on thefollowing:

WarpSize <32 > SharedBankSize <32 > BlockTile <8 ,8 ,8 > WarpTile <4 ,8>ThreadTile <8 ,8 ,1 > GlobalVectorSize <2> SharedVectorSize <4>SwizzleBits <0 ,1 > Unroll <1>

The DSL is invoked, in an embodiment, with an expressions file andconfiguration file using any suitable syntax. In an embodiment, thefollowing command invokes a DSL:

>1a_dsl<expressions_file><config_file>

As a result, in an embodiment, three output files are generated, a hostfile, a device file, and a header file, which may be in the followingformat:

<prefix>_host.cu <prefix>_device.cu <prefix>.hwherein the <prefix> placeholder refers to the filename prefixeddescribed above, in accordance with one embodiment. In an embodiment, anexpression file and configuration file are implemented in connectionwith FIG. 1 .

In an embodiment, computer systems described throughout this disclosureutilize polyhedral compilation techniques which are applicable to atleast a sub-class of code or executable routines called static controlparts (SCoPs). In an embodiment, a part of a program is identifiable asa SCoP if it meets certain requirements: first, any iterative loops suchas FOR-loops include constant strides, and conditional statements (e.g.,IF-ELSE conditions, ternary operators) whose conditional express andloop bound expressions are affine functions of program parameters andsurrounding loop induction variables. The stride of a loop, in anembodiment, refers to the distance in memory of data access bysuccessive iterations of a loop. In an embodiment, array accessfunctions of the statements are expressible as affine functions ofprogram parameters and surrounding loop induction variables. In anembodiment, linear algebra computations are SCoPs. In an embodiment, apolyhedral model captures the execution of SCoPs in a compact form assets and relations. In an embodiment, ISL terminology may be used todescribe the sets and relations.

An iteration domain may refer to a compact representation of statementinstances of a SCoP statement. In an embodiment, the set of alliteration vectors of a statement S is the iteration domain of S. As anillustrative example, consider the following matrix-matrixmultiplication:

for (i=0; i<N; i++)  for(j=0; j<N; j++)   for (k=0; k<N; k++)    S1:C[i][j] += A[i][k] * B[k][j];The iteration domain of the statement S1 as described above, which isdenoted as D_(S1), is given by [N]→{S1[i,j,k]: 0≤i,j,k<N} wherein theleft-hand side of the arrow operator→refers to a list of parameters thatdefines the set; the space in which the set lives in is denoted by S1[ .. . ]; the points belonging to the set are specified using thePresburger formulae shown to the right of the colon: thereby modelingunion of integer polyhedrons.

In an embodiment, access functions map statement instances to memorylocations from where data is read from or written to. In an embodiment,for a polyhedral model, access functions are represented by affine mapsfrom iteration domain to data sets. In an embodiment, such as those inconnection to the matrix-matrix multiplication described above, readfrom array A is given by the map [N]→{S1[i,j,k]->A[i,k]: 0≤i,j,k<N}.

In an embodiment, the order in which statement instances are to beexecuted is defined by a schedule which may be encoded as a schedulefile. In an embodiment, a schedule is an affine schedule S_(S) for astatement S is an affine map that assigns a multi-dimensional integervalue to each instance of S. In an embodiment, an affine schedule is inaccordance with those described in connection with “Scheduling UniformLoop Nests” to Darte and Robert, which is hereby incorporated byreference. In an embodiment, a statement instance s₁ of S is executedbefore an instance s₂ if the value v₁ assigned by S₁, to s₁ islexicographically smaller than the value v₂ assigned to s₂. In anembodiment, the affine schedule corresponding to the execution orderprovided by the code segment of the matrix-matrix multiplicationdescribed above for statement S₁ is given byS_(S1):=[N]→{S1[i,j,k]→[i,j,k]: 0≤i,j,k<N}.

In an embodiment, a schedule specifies a relative or absolute timestampfor each point (e.g., iteration vector) in the iteration domain. In anembodiment, affine schedules are interpreted as a set of hyperplanes—forthe matrix-matrix multiplication described above, the set of hyperplanesfor the schedule S_(S1) is given by {(1, 0, 0), (0, 1, 0), (0, 0, 1)},which corresponds to the range [i,j,k] of S_(S1). In an embodiment, acode generate generates appropriate loop nests, corresponding to anaffine schedule, using the aforementioned hyperplane directions and thebounds on the iteration domain.

In an embodiment, dependencies between iteration points of a SCoP arederived from its corresponding schedule and access functions. In anembodiment, and in accordance with the matrix-matrix multiplicationdescribed above, both the read and write accesses of statement S1 toarray C are given by: R_(S1,C):=W_(S1,C):=[N]→{S1[i,j,k]→C[i,j]:0≤i,j,k<N}. In an embodiment, read-after-write (RAW) dependence P_(S1)^(RW):=[N]→{S1 [i,j,k]→S1[i,j,k+1]:0≤i,j<N∧0≤k<N−1} is derived from theschedule S_(S1), and the relations R_(S1,C) and W_(S1,C). In anembodiment, determining a polyhedral optimization involves determiningan optimal schedule for execution under a given objective such asmaximizing parallelism, minimizing data movement, etc. In an embodiment,a polyhedral optimization involves determining a sub-optimal such as aschedule that exceeds a threshold level of parallelism or fellas below athreshold level of data movement and so forth. In an embodiment, datadependencies play an important role by constraining the possibleschedules that are derivable. Valid schedules S for a statement respectRAW, WAW, and WAR dependencies. In an embodiment, if an iteration vectorv₁ has a RAW dependence on vector v₂, then S(v₁)

S(v₂) where

denotes lexicographical order. In an embodiment, a read-after-read (RAR)dependency indicates reuse of data which is utilized in findingschedules that improve data locality.

In the preceding and following description, various techniques aredescribed. For purposes of explanation, specific configurations anddetails are set forth in order to provide a thorough understanding ofpossible ways of implementing the techniques. However, it will also beapparent that the techniques described below may be practiced indifferent configurations without the specific details. Furthermore,well-known features may be omitted or simplified to avoid obscuring thetechniques being described.

As one skilled in the art will appreciate in light of this disclosure,certain embodiments may be capable of achieving certain advantages,including improving operation of processing units of a computer bygenerating executable code that causes the processors to moreefficiently utilize computing resources such as registers andhigher-level caches.

FIG. 1 illustrates a computing environment 100 in which a DSL compileris implemented. In an embodiment, the DSL compiler comprises threecomponents: a frontend 102 that parses the input program, and convertsthe program into its polyhedral representation; a scheduler 104 thatcomputes an efficient computation schedule corresponding to the inputprogram for the target GPU; and a code generator 106 that generatesdevice functions from the schedule computed by the scheduler.

In an embodiment, the frontend 102 accepts an input program file 108from a user and converts it into an equivalent polyhedral intermediaterepresentation (IR), which is then passed to the scheduler 104 foroptimization. An equivalent polyhedral IR, in an embodiment, refers toany suitable polyhedral representation generated from a directed acyclicgraph of the input program wherein the DAG includes nodes thatcorrespond to operators and operands of the input program. In anembodiment, the frontend 102 also accepts a configuration file 110 whichspecifies a set of parameters for determining how to process the inputfile to determine a schedule. In an embodiment, the DSL supports anintuitive syntax to represent linear algebra expressions comprising aseries of operations that include general matrix-matrix multiplication(GEMM), matrix addition, and more. In an embodiment, the frontend 102,as part of converting the input program to a polyhedral representation,the DSL constructs an expression directed acyclic graph (DAG) 112 wherenodes represent linear algebra primitives such as variables andoperators. In an embodiment, dependencies are captured by the directededges between nodes. In an embodiment, each node is associated with anaffine set, called dataset, and an affine map, called datamap. In anembodiment, the frontend 102 is implemented in any suitable mannerincluding as a software module of a computer system. In an embodiment,the frontend 102 accepts the input program file and the configurationfile via an application programming interface (API) command.

In an embodiment, each node of the DAG has an associated affine setcalled dataset, that contains the information about the space needed tohold its data—a node corresponding to an input matrix Matrix A(M,K) hasan associated dataset [M, K]→{A[i,j]:0<=i<M and 0≤j<K}, which is atwo-dimensional set with each dimension constrained by the matrix size.In an embodiment, dimensionality and sizes of the datasets for inputnodes are obtained from declarations specified by the user. In anembodiment, the datasizes of remaining nodes in a DAG are derived usingpredefined rules.

In an embodiment, datamaps are affine functions that capture data layoutinformation of datasets. For instance and in an embodiment, Matrix A mayhave a dataset {A[i,j]:0<=i<128 and 0<=j<64}. In an embodiment, thedatamap can be used to determine whether different expressions can sharethe same address space—for example matrix A and the transpose of thematrix A{circumflex over ( )}T semantically indicates that the elementsof matrix A are just reordered to their transposed locations, withoutany change in their values so both expressions A and A^(T) can share thesame physical memory space using the datamap {A^(T)[i,j]→A[j,i]:},representing transpose DLT. In an embodiment, datamap of all types ofnodes, except transpose ({circumflex over ( )}) node, is an identity mapfrom its dataset to itself. Datamap for transpose node

N ₁ [N ₂ [E ₁]^(T) :=M(N ²)°{M_1[i,j]->M_2[j,i]:}.

In an embodiment, the frontend 102 is configured with executable codefor extracting the polyhedral IR 114 of the input program by buildingiteration domains, access functions, and data dependence relationscorresponding to the expressions. In an embodiment, the iteration domainand data dependence information is required to find a valid schedule forthe computation, while access functions are needed for data layouttransformations, and memory promotions.

In an embodiment, iteration domains of different nodes live in differentset-spaces, and are identified by their space names. In an embodiment,the iteration domain of various operations is defined usingdomain-specific knowledge or rules. A domain-specific language may bedescribed in contrast to a general-purpose language (GPL) which isbroadly applicable across application domains and lacks specializedfeatures for a particular domain. Examples of GPLs includegeneral-purpose markup languages such as XML, general-purpose modelinglanguages such as the Unified Modeling Language (UML), andgeneral-purpose programming languages such as C++, Java, PHP, andPython. In an embodiment, each operation is associated with two accessmaps—read, and write access maps wherein read maps capture readinformation of an operation by relating its iteration domain points todatasets of its operands, while write maps are used to capture writeinformation by relating iteration domain points to its own dataset. Inan embodiment, aliases between datasets for the operands are resolvedwith the help of datamaps.

The scheduler 104 illustrated in FIG. 1 is implemented using hardware,software, or a combination thereof. In an embodiment, the scheduler usesISL's implementation of PLuTo scheduling algorithm to obtain an initialaffine schedule 116. In an embodiment, multiple loop nests are fused ifand only if fusing them doesn't lead to loss of parallelism.

In an embodiment, the scheduler applies a sequence of optimizations tothe initial affine schedule that was obtained, which may include one ormore of the following: tiling; memory promotion; software pipelining;and more. Tiling serves the dual purpose of improving data locality, andenabling distribution of workload to thread and thread blocks, in anembodiment. In an embodiment, efficient memory promotion is achieved bydistributing parallel loops in a block-cyclic fashion. In an embodiment,memory promotion refers to utilization of memory in which differentthreads within a threadblock share data through shared memory whoselatency is lower than global memory. In an embodiment, a GPU's sharedmemory is software managed.

In an embodiment, the final affine schedule 118, after application ofseveral transformations is represented in the form of a schedule treeand a device function is created by the code generator corresponding toeach high-level loop nest where grid sizes and threadblock sizes arederived by analyzing the bounds of parallel schedule dimensions that aremarked to be distributed. In an embodiment, during code generation bythe code generator 106, the loops that are marked for distribution arereplaced with appropriate CUDA variables (such as blockIdx.y, etc), andthe remaining loops and statements are printed within body of the devicefunction. Diesel also generates a driver function that could be calledby the user to run the sequence of device functions. In an embodiment,the resulting output is executable code 120 that is executable accordingto one or more processor architectures. In an embodiment, the executablecode 120 includes instructions that, as a result of execution by acomputer system comprising a GPU 122, causes the computer system to runthe input program with the application of one or more optimizations thatimproves the efficiency of the operation. In an embodiment, a computersystem 124 implements any suitable combination of the frontend 102,scheduler 104, and code generator 106, which may be software modules ofa software application. In an embodiment, the computer system 124comprises the GPU 122. In an embodiment, the GPU 122 is a component ofthe computer system 124. In an embodiment, the GPU 122 is a processor ofanother computing entity. In an embodiment, the GPU 122 is a processor(e.g., a general use processing unit) and the executable code 120includes optimized linear algebra kernels that, if executed by the GPUat least in part, controls an autonomous vehicle such as a drone,self-driving vehicle such as an automobile, unmanned aerial vehicle(UAV), and more. In an embodiment, the executable code 120 implements amachine-learning or deep learning software program or algorithm thatincludes the performance of one or more matrix multiplication operationswhich are accelerated through execution on the GPU 122.

FIG. 2 illustrates a computing environment 200 in which a program isimplemented according to techniques described here to generate optimizedexecutable code. In an embodiment, a directed acyclic graph 202 isgenerated from an input program file 204 and the DAG is used to generatea polyhedral IR 206 of the input program.

In an embodiment, the program is written in a domain-specific languagethat describes a high-level matrix multiplication operation. In anembodiment, the DSL is described based on or in the following manner:

Matrix A (1024 ,1024); Matrix B (1024 ,1024); C = A * B; CodeGen (FLOAT, {A, B, C}, {C}, sgemm );

In an embodiment, the operations above are encoded in an input programfile 204 and a configuration file can be specified that provides a setof parameters for determining how to process the input file to determinea schedule and, from the schedule, computer-executable instructionsaccording to a processor architecture. In an embodiment, a suitablesystem uses a parser to cover the high-level representation to a DAG202. In an embodiment, nodes of the DAG represent operators, variables,and other linear algebra primitives. In accordance with FIG. 2 and in anembodiment, the circles represent data matrices and the diamondscorrespond to operations. In an embodiment, the DAG is used to determinewhere intermediate results are needed and to calculate the size of thoseintermediates.

In an embodiment, the parser constructs that DAG and a table lookup ofoperations is used to convert the DAG to a set of equations inpolyhedral form as a polyhedral IR 206. As a point of comparison, anun-optimized CPU implementation of the above code, in an embodiment,might be:

for (i = 0; i < 1024 i ++){  for (j = 0; j < 1024; j ++){   C[i][j] = 0;  for (k = 0; k < 1024; k ++){    C[i][j] += A[i][k] * B[k][j]   }  } }

In an embodiment, a polyhedral compiler beings by considering theiteration space that represents the set of possible statement instances,such as the range of the indices i, j, and k in the range of0<=i,j,k<1024. In an embodiment, the polyhedral compiler then finds thedependencies among iterations—in this case, there is a dependencecarried by the k loop since each C[i][j] depends on the value in theprevious iteration. In an embodiment, the operation in the inner loopcomprises of loads A and B and fused multiply add (FMA) operations aswell as assorted loop control statements. In an embodiment, polyhedralmethods are used to perform various optimizations to reorder the loops,add tiling, apply fusion, spit the loops up and introduce shared memory,and more. In an embodiment, the code is reshaped to add more loops sothat outer block level loops which have no dependence and can be mappeddirectly to the parallel execution of GPU blocks, per warp loops, andper thread loops.

In an embodiment, the computation (e.g., matrix multiplication) is splitinto tiles having tilesizes specified by the configuration file (e.g.,8×8 tiles). In an embodiment, a tile having an 8×8 size is computed byreading 8 elements of a first matrix and 8 elements of a second matrixto combine them into 64 accumulated sums. In an embodiment, as the loopsget formed, the DSL does unrolling using the following code:

for each block ( ) {  // Copy a tile of A and B from global mem toshared mem .  // This is split into 2 phases .  // 1) global −>private_buffer ; 2) private_buffer −> shared  Load_A_Block (0); Load_B_Block (0);  Store_A_Block ( );  Store_B_Block ( ); _(——)syncthreads ( );  // main loop  for (int k = 0; k < 127; k++) {  for each thread in warp {    // Prefetch data for next iteration fromglobal to private buffer    Load_A_Block (k +1);    Load_B_Block (k +1);   for (kk = 0; kk < 8; kk ++) {     // Copy a col of A and a row of Bfrom shared to private arrays     Load_A_Col ( );     Load_B_Row ( );    // Perform outer product     FMA (A_pvt , B_pvt , C_pvt );    }   _(——)syncthreads ( );    // Store prefetched data from buffer toshared mem    Store_A_Block ( );    Store_B_Block ( );   _(——)syncthreads ( );   }  } // end main loop processing tiles  //Compute last tile  for (kk = 0; kk < 8; kk ++) {   // Copy a col of Aand a row of B from shared to private arrays   Load_A_Col ( );  Load_B_Row ( );   // Perform outer product   FMA(A_pvt , B_pvt , C_pvt);  }  // write the output from private to global mem  for each warp inthe block {   for each thread in warp {    Store_C ( );   }  } }

In an embodiment, some polyhedral transformations are standard (e.g., inthe sense that they are processor-agnostic and apply to variousarchitectures) where as others make use of GPU specific features. In anembodiment, block sizes are specifiable by a user via a configurationfile, allowing for a future auto tuner. In an embodiment, a table drivenapproach is used to map DAGs to loop nests. In an embodiment, the codeis structured so that the outermost loops have no loop carrieddependence so that they can be mapped to GPU threadblocks. In anembodiment, shared memory is supported. In an embodiment, parts of inputmatrices are fetched from global memory so that loads and FMAs canoverlap In an embodiment, the code is swizzled to exploit hardwarehigh-bandwidth patterns.

As part of a fourth step, and in an embodiment, the computation furtherincludes two parts: outer chunk, and inner chunk (that is nested withinthe outer chunk) where outer chunk is distributed to differentthreadblocks, and the inner chunk is executed by each thread. In anembodiment, the inner chunk reads 8 values (e.g., as determined by thetilesize) from shared memory and computes the sums, which can be writtenusing a series of loads, FMAs, and assorted loop controls. In anembodiment, the code is implemented in the following manner:

for (k = 0; k < 8 k ++){  for (i = 0; i < 8; i ++){   for (j = 0; j < 8;j ++){    C[i][j] += A[i][k] * B[k][j]   }  } }

This code is, in an embodiment, computationally more efficient than theun-optimized CPU implementation described above. In an embodiment,unlike the outer chunk where the outermost loops should be parallel sothat they can be distributed, at the inner chunk, we speed this upexchanging the loops so that the dependence is on the outermost loop andthen to unroll the inner loops and the result is that at the outer levelwe get coarse-grained parallelism while we get fine-grain parallelism atthe inner loops.

In an embodiment, the system remove the loop overhead by fully unrollingthe code which results in a performance gain since the FMA operationscan be overlapped and the loop control overhead eliminated, butaccumulator values are kept in register to increase the amount ofunrolling. In an embodiment, the tilesize is adjusted in theconfiguration file to get the most unrolling without spilling. Formallythis transformation replaces the dot product inner loop with an outer(or as it is sometimes called tensor) product. Each k iteration does areduction over matrices. After the transform the code needs to load acolumn of A's and a row of B's followed by an outer product of A⊗Bgiving 16 loads for 8 FMA operations. Notice that there is no dependencewithin each k iteration. So any order of the FMA operations produces thesame numerical result but it turns out that some orders are faster thanothers.

In an embodiment, each set loads a set of a and b values from sharedmemory, wait for the loads to finish, and then applies generated FMAsequence. In an embodiment, if there were a lot of threads, hardwarewould fill the time waiting for loads to finish by running otherthreads. However, due to the amount of shared memory in use there maynot be enough threads to cover the latency. In an embodiment, ratherthan using 8 registers to hold A and B slices, double-bufferingtechniques may be two sets of 8 are used to hold two slices so that theycan be overlapped. In an embodiment, the a computer system 208 such asthose described in connection with FIG. 1 execute the steps described togenerate a DAG from the input program and then determine a polyhedralform from the DAG.

FIG. 3 illustrates a computing environment 300 showing an architectureof the DSL compilation process, in accordance with one embodiment. In anembodiment, the frontends 302 each accept input programs from the uservia an interface 304 and converts them into an equivalent graph internalrepresentation 306 which may be represented as directed nodes withoperators and operands as the nodes. In an embodiment the graph IR 306is provided to the DSL library 308 which is software that exposes a setof C APIs allowing other libraries to directly link to diesel togenerate optimized GPU kernels. In an embodiment, the optimized GPUkernels are optimized to run on one more specific microarchitectures310, each of which may support different types of hardwareoptimizations.

FIG. 4 shows an illustrative example of a process 400 for generatingoptimized executable code, in accordance with an embodiment. Some or allof the process 400 (or any other processes described herein, orvariations and/or combinations thereof) may be performed under thecontrol of one or more computer systems configured with executableinstructions and may be implemented as code (e.g., executableinstructions, one or more computer programs, or one or moreapplications) executing collectively on one or more processors, byhardware, software, or combinations thereof. In an embodiment, the codeis stored on a computer-readable storage medium in the form of acomputer program comprising a plurality of instructions executable byone or more processors. In an embodiment, the computer-readable storagemedium is a non-transitory computer-readable medium. In an embodiment,at least some of the instructions usable to perform the process 400 arenot stored solely using transitory signals (e.g., a propagatingtransient electric or electromagnetic transmission). A non-transitorycomputer-readable medium, in an embodiment, includes non-transitory datastorage circuitry (e.g., buffers, caches, and queues) withintransceivers of transitory signals. In an embodiment, the process 400 isperformed by any suitable system, such as a compute resource inaccordance with those described in connection with FIG. 1 .

A computer system, in an embodiment, performs at least a portion of aprocess for generating executable code from an input expression byexecuting computer instructions for obtaining 402 a set of instructionsfor executing a computer program. In an embodiment, a computer programrefers to a set of instructions encoded in a domain-specific language.In an embodiment, the computer program is encoded as a user-generatedinput program file. The system, in an embodiment, parses an input fileprovided by a user and builds an expression. In an embodiment, thesystem includes code for composing 404 a DAG from the computer systeminstructions where the nodes of the DAG represent matrix objects,denoting user specified input/output matrices, or intermediate matricesof various operations. In an embodiment, each node has an associatedtype which may be one of the following: an arithmetic operation; anassignment operation; a transposition operation; and input and outputmatrix. In an embodiment, each node of the DAG encodes metadatacorresponding to the input such as its data size. In an embodiment, thesize of a matrix object is encoded as the size of the data object alongeach dimension (e.g., for a two-dimensional matrix, the row and columnsizes are stored as the data size vector). In an embodiment, the size ofinput matrices are specified by the user, and the size of intermediateand output matrices are calculated. The data sizes of other DAG typesmay be derived by using applicable mathematical rules, such as in thefollowing manner:

[

e ₁

]_(m,n) +[

e ₂

]_(m,n) →[

e ₁

+

e ₂

]_(m,n)

[

e ₁

]_(m,n) −[

e ₂

]_(m,n) →[

e ₁

−

e ₂

]_(m,n)

[

e ₁

]_(m,p) *[

e ₂

]_(n,p) →[

e ₁

*

e ₂

]_(m,n)

[

e ₁

]^(T) _(m,n) →[

e ₁

^(T)]_(m,n)

[

M

] ^(T) _(m,n) =[

e ₈

]_(m,n) →[

M

] _(m,n)

In an embodiment, the DAG is built from an input program and any commonexpressions are eliminated. in an embodiment, the expression DAG isbuilt to propagate information between the nodes to extract polyhedralrepresentation of the computation to be performed and, to assist in theprocess, some or all node N are associated with an integer set, calleddataset (S(N)), and an affine map, called datamap (M(N)), as part ofconstructing the DAG. In an embodiment, a dataset is a convex polyhedralrepresentation of data elements of a node. In an embodiment, a datamapis an affine relation that captures data-layout information of adataset. In an embodiment, a matrix A has a dataset {A[i,j]:0<=i,j<10}and a matrix C has dataset {C[i,j]:0<=i,j<10} and the expression C=Asemantically indicates that matrix C is an alias to A and, hence, bothmatrices A and C can share the same memory space (e.g., same physicalmemory address space). This information is captured using the datamap{C[i,j]→A[i,j]:}, in an embodiment, which indicates that dataset of C isan identity map to dataset of A.

The system, in an embodiment, performs steps for determining 406 apolyhedral model of the DAG by extracts the polyhedral model for theinput program by building iteration domains, access functions, and datadependence relations corresponding to the expressions, in accordancewith one embodiment. In an embodiment, the iteration domain and datadependence information are utilized to determine a valid schedule forthe computation, whereas access functions are utilized for codegeneration. In an embodiment, iteration domains—which mathematicallycapture the operation of instances of expressions—are constructed usingdomain specific knowledge such as a set of domain specific rules. In anembodiment, iteration domains of different nodes live in different setspaces and are identified by their space names.

Access maps relate points in iteration domain to data sets and, in anembodiment, each operation is associated with two access maps—a readaccess map and a write access map. In an embodiment, read maps captureread information of an operation by relating its iteration domain pointsto datasets of its operands, whereas write maps capture writeinformation by relating iteration domain points to its own data set. Inan embodiment, aliases between data sets for the operands are resolvedbased on the datamaps.

In an embodiment and as part of building read-after-write datadependencies, intra-domain data dependencies are directly constructedbased on domain specific knowledge; the intra-domain data dependenciesare obtained by constructing live-maps that capture last-writerinformation. In an embodiment, the iteration domain for the expressionA*B discussed in connection with FIG. 1 is given by{D_(1[i,j,k]):0≤i<1024 ∧0≤j<1024 ∧0≤k<1024}, and its write access map is{D_(1[i,j,k])→M_1[i,j]:}. In an embodiment, domain specific knowledge fmatrix-matrix multiplication is used to determine the last writers ofD_1 to M1 are {D 1[i,j,k]: 0<=i<1024 ∧0<=j<1024∧k=1023} and that thelive-map for A*B is given by {D 1[i,j,k]->M 1[i,j]: 0<=i<1024∧0<=j<1024∧k=1023}.

Read-after-read dependencies indicate data reuse between points in aniteration domain, in an embodiment, and this information is taken intoaccount while computing schedules to improve data locality. In anembodiment, given a union of read access relations (R) of all theoperations, RAR dependencies (R_(R)) can be computed as follows:R_(R):=R⁻¹∘R.

The system, in an embodiment, is further configured with instructionsfor using 408 the polyhedral model to determine an optimized or optimalschedule. In an embodiment, the system determines an improved schedulethat reduces the runtime but is not necessarily optimal (e.g., asub-optimal schedule improvement). In an embodiment, once the iterationdomains and data dependencies are collected, an initial affine scheduleis computed. In an embodiment, the initial affine schedule maximizesdata reuse. In an embodiment, the DSL utilizes a scheduling algorithm(such as PLuTo's scheduling algorithm described in “A PracticalAutomatic Polyhedral Program Optimization System” by Bondhugula, et al.)to determine the optimal schedule. In an embodiment, PLuTo's costfunction is used to maximize data reuse by generating a schedule thatincludes loop fusion transformations, is amenable to tiling, and otherfeatures which may be supported according to a configuration file. In anembodiment, loop fusion refers to a routine for replacing code formultiple loops with a single loop.

In an embodiment, to generate a SPMD code for GPUs, at least oneoutermost loop of the computed schedule is parallel. PLuTo, in anembodiment, generates a schedule that is outer-parallel/inner-sequentialby default, and for each individual operation that is supported by theDSL, there exists at least one affine schedule with parallel outerloops. In an embodiment, maximally fusing the loops (e.g., as part ofimproving data locality) does not necessarily lead to an optimalscheduling as it may result in a schedule with no outer parallelloops—accordingly, the system may allow for application of fusion whenit is detected that the fused schedule is not fully sequential toprovide an initial schedule with a sequence of loop nests, each with atleast one outer parallel loop. Consider the initial schedule computedfor the expression E=(A+B)*(C+D) in an embodiment and a sequenceincluding two loop nests, where the first loop nest computes theexpression C+D, while the remaining addition and multiplicationoperations are fused into a single loop nest.

Once an initial affine schedule is determined, in an embodiment, asequence of transformations and/or optimizations are applied, each ofwhich adds further improvements to the schedule. In an embodiment, thetransformations and optimizations to apply are specifiable in aconfiguration file. In an embodiment, the transformations are applied ata polyhedral level. In an embodiment, the transformations are directlyhandled during code generation. In an embodiment, one or more of thefollowing transformations/operations are supported: tiling; insertion ofdata copy-in/copy-out statements; block-cyclic distribution of loops;double buffering; array transposition during copy; array padding;strip-mining; swizzling; loop unrolling; and any suitable combinationthereof.

In an embodiment, tiling is utilized as an optimization to improve datalocality and/or expose parallelism to workload distributions. In anembodiment, the schedule for the matrix-matrix multiplication C=A*B isused to generate a transformed schedule in the following manner:

Initial Schedule for (i=0; i <1024; i++)  for (j=0; j <1024; j++)   for(k=0; k <1024; k++)    C[i][j] += A[i][k] * B[k][j]; TransformedSchedule After Single Level Of Tiling With Tilesize <16, 16, 16> for(i=0; i <64; i++)  for (j=0; j <64; j++)   for (k=0; k <64; k++)    for(ii =0; ii <16; ii ++)     for (jj =0; jj <16; jj ++)      for (kk =0;kk <16; kk ++)       C[16* i+ii ][16* j+jj] += A[16* i+ii ][16* k+kk] *B[16* k+kk ][16* j+jj ];

As described above, the outer two parallel inter-tile loops are suitablefor distribution among different threadblocks in a process such as aGPU, in an embodiment. Corresponding kernel functions obtained afterdistributing inter-tile iterations among threadblocks and inter-tileiterations among threads are, in an embodiment, described in thefollowing manner:

int bid_y = blockIdx.y; int bid_x = blockIdx.x; int tid_y = threadIdx.y;int tid_x = threadIdx.x; for (k=0; k <64; k++)  for (kk =0; kk <16; kk++)   C[16* bid_y + tid_y ][16* bid_x + tid_x ] +=    A[16* bid_y +tid_y ][16* k+kk] *    B[16* k+kk ][16* bid_x + tid_x ];

Furthermore, the code has data reuse along multiple dimensions such thatdata shared between different inter-tile iterations can be reusedthrough shared memory, in accordance with one embodiment.

In an embodiment, up to three levels of tiling are applied—block level,warp level, and thread level. In an embodiment, block level tilingallows distributions of iterations among threadblocks and reuse of datathrough shared memory. In an embodiment, warp level tiling is performedto aid application of warp level optimizations, as described in greaterdetail below. In an embodiment, thread level tiling distributesiterations among threads within a block, and also allows data reusethrough registers.

Data reuse within a kernel occurs, in an embodiment, based on either RARdependencies (e.g., same data is read multiple times within a kernel) orRAW dependencies (e.g., data is both computed and read within the samekernel). In an embodiment, since input expressions are enforced to be insingle-assignment form, WAR/WAW dependencies are within a singlestatement and are implicitly covered by RAW dependencies. In anembodiment, input data access multiple times within a kernel are copiedinto shared memory prior to the actual computation and read from sharedmemory. In an embodiment, a temporary space in the shared memory isutilized to reuse data that is both written and read within the samekernel. In an embodiment and at the end of the computation, newlycomputed live-out data is copied out from shared memory to globalmemory. In an embodiment, the process involves the following steps:identifying reused accesses; determining data-tile sizes for reuseddata-sets; inserting new iteration domain and schedule forcopy-in/copy-out code into previously computed schedule; and updatingaccess functions of the statements.

In an embodiment, identifying inter-thread data reuse, in an embodiment,schedule-to-access map M for statements that are scheduled togetherwithin a device function are collected and, to compute inter-threadreuse, the innermost dimensions corresponding to schedules private toeach thread are projected out from Domain(M). In an embodiment, thepresence of inter-thread data reuse is checked by testing whether theresulting map R is mathematically injective. In an embodiment,inter-thread data reuse exists if and only if R is not injective basedon principles of mathematical set theory. In an embodiment, data reusedthrough registers is determined based on the techniques described above.

In an embodiment, tilesizes are determined once a union of reused dataaccess maps, R, are collected in the following manner: the tile sizes ofshared arrays are determined for each map m [e] R by checking if therange of m is a convex set and m is ignored if it is not convex. In anembodiment, outer)inter-tile_dimensions corresponding to thethreadblocks in the domain of m are parameterized. In an embodiment,this provides us with an access map for a single data-set accessed byany single parameterized threadblock. In an embodiment and for eachdimension of the data-set, the system then checks whether a constantlength for that dimension can be found from the constraints governingthe Range(m) and if such a constant length cannot be found the data setis ignored, but if it is, offsets to the array accesses, in terms of thethreadblock parameters, is extracted and stored for computing sharedarray to a global array access map. In an embodiment, if the threadaccess arrays in a strided pattern (e.g., with block-cyclicdistribution) their stride lengths are extracted from the constraints.

For inserting copy-in/copy-out statement schedules and in an embodiment,for each data-set s for which a constant tilesize is found, a newiteration domain denoting the copy operation is constructed. In anembodiment, the constructed iteration domain is a basic convex integerset with sizes along each dimension equal to reused array sizes. In anembodiment, a schedule to copy the data is manually constructed suchthat global array accesses are coalesced. In an embodiment, array offsetand stride information collected (e.g., previously) are utilized tobuild schedule-to-global array access maps.

To update access functions, the system, in an embodiment, extractsglobal-to-shared array maps once the copy-schedules and their accessmaps are built. In an embodiment, this information is utilized inconnection with replacing global array access maps of various statementswith shared array access maps.

In an embodiment, tiling, as described in greater detail below, blockdistributes loop iterations and results in a strided access pattern bythreads. In an embodiment, with cyclic distribution, consecutive threadsare configured to operate on consecutive data elements, therebyeliminating bank conflicts. In an embodiment, loads and stores arevectorized through block-cyclic distribution by distributing loopiterations among T threads (along each direction) with blocksize B,tiling loops to be distributed with tilesize T×B, and tiling theresulting intra-tile loop again with tilesize B, and then permutatingthe first level inter- and intra-tile loops. In an embodiment, eachiteration of the outermost band of the resulting loop nest is assignedto be executed by a single thread. In an embodiment, a block-cyclicallydistributed version of single-precision general matrix multiply (SGEMM)code with blocksize of 4, allowing conflict free 128-bit vector loads ofelements from shared arrays Bs to registers is implemented in thefollowing manner:

for (int i=0; i <=1; i +=1)  for (int j=0; j <=3; j +=1)   Ap[4*i+j ][0]= As [32* wid_y +4* lid_y +16* i+j][kk]; for (int i = 0; i <= 1; i += 1) for (int j = 0; j <= 3; j += 1)   Bp[0][4* i+j] = Bs[kk ][64* wid_x +4*lid_x +32* i+j]; for (int i=0; i <=1; i +=1)  for(int j=0; j <=1; j +=1)  for (int ii =0; ii <=3; ii +=1)    for (int jj =0; jj <=3; jj +=1)    Cp[4* i+ii ][4* j+jj] += Ap[4* i+ii ][0] * Bp[0][4* j+jj ];

In an embodiment, the following SGEMM kernel code implements threelevels of tiling and utilizes shared memory and registers for datareuse:

int bid_y = blockIdx.y, bid_x = blockIdx.x; int tid_y = threadIdx.y,tid_x = threadIdx.x; int linearId = tid_y *16+ tid_x ; int warpId =linearId /32; int laneId = linearId %32; int wid_y = warpId /2, wid_x =warpId %2; int lid_y = laneId /8, lid_x = laneId %8; _(——)shared_(——)float As[128][8], Bs[8][128]; float Ap[8][1] , Bp[1][8] , Cp[8][8] ={0}; for (int k=0; k <=127; k +=1) {  // Global −> Shared copy  for (inti=0; i <=3; i +=1) {   As[32* wid_y +16* wid_x + lid_y +4* i][ lid_x] =A[...];   Bs[2* wid_y + wid_x ][8* lid_y + lid_x +32* i] = B[...];  } _(——)syncthreads ( );  for (int kk = 0; kk <= 7; kk += 1) {   // Shared−> Reg copy   for (int i=0; i <=7; i +=1)   Ap[i][0] = As[32* wid_y +8*lid_y +i][ kk ];   for (int i=0; i <=7; i +=1)   Bp[0][i] = Bs[kk ][64*wid_x +8* lid_x +i];   // Compute   for (int i=0; i <=7; i +=1)   for(int j=0; j <=7; j +=1)    Cp[i][j] += Ap[i][0] * Bp[0][ j];  } _(——)syncthreads ( ); } // Reg −> Global copy for (int i=0; i <=7; i+=1) {  for(int j=0; j <=7; j +=1) {   C[...] = Cp[i][j];  } }

In an embodiment, double buffering techniques for improving overlapbetween computations and data-movement at the cost of using additionshared memory is implemented wherein each threadblock loads a tile ofdata of the input array from global to shared memory and performscomputations by reading data from shared memory. In an embodiment, alldata required by a threadblock is loaded before starting the computationby making a _syncthreads( ) call between the data transfer and thecomputation loop. In an embodiment, the computation a _syncthreads( )function call is inserted after the computation loop and before new datais overwritten to the shared memory arrays. In an embodiment, adouble-buffering implementation requires additional shared memory to beallocated and new data needed by the next iteration is loaded into thisbuffering space so that while some warps are still performingcomputational work, the remaining warps can safely load the data to thebuffer space, thereby reducing or eliminating the need forsynchronization at the end of the computation loop.

The DSL determines, in an embodiment, a set of loops and arrays that areappropriate for double buffering such that for a given loop l, all ofthe arrays that are copied in from global to shared memory in l aremarked as double-buffered so that once a set of double-buffered arraysare collected, sizes of arrays marked as double-buffers are doubled andtheir corresponding _syncthreads( ) calls are replaced with code toupdate the pointer to the right position in the buffer. In anembodiment, double-buffered version of SGEMM code is based on orimplemented in the following manner:

_(——)shared_(——) float As[2][128][8], Bs[2][8][128]; for (int k=0; k<=127; k +=1) {  // Global −> Shared copy  for (int i=0; i <=3; i +=1) {  As[buf][32* wid_y +16* wid_x + lid_y +4* i][lid_x] = A[...];  Bs[buf][2* wid_y + wid_x ][8* lid_y + lid_x +32* i] = B[...];  } _(——)syncthreads ( );  for (int kk = 0; kk <= 7; kk += 1) {   // Shared−> Reg copy   for (int i=0; i <=1; i +=1)    for (int j=0; j <=3; j +=1)    Ap[4* i+j ][0] = As[buf][32* wid_y +4* lid_y +16* i+j][kk];   for(int i = 0; i <= 1; i += 1)    for (int j = 0; j <= 3; j += 1)    Bp[0][4* i+j] = Bs[buf][kk][64* wid_x +4* lid_x +32* i+j];   //Compute   for (int i=0; i <=1; i +=1)    for (int j=0; j <=1; j +=1)    for (int ii =0; ii <=3; ii +=1)      for (int jj =0; jj <=3; jj +=1)      Cp[4* i+ii ][4* j+jj] += Ap[4* i+ii ][0] * Bp[0][4* j+jj ];  } buf = (buf ? 0 : 1); }

In an embodiment, an optimization based on array transposition duringcopying is implemented. In an embodiment, a shared memory array isaccessed with a stride by different threads in a warp which leads to thepossibility of shared memory bank conflicts. In order to minimize orreduce such conflicts, the DSL performs, in an embodiment, a check todetermine if is beneficial to transpose an array as it is copied fromglobal to shared memory. In an embodiment, the check comprises checkingthe access function of fastest varying dimension of to see if it is afunction of thread indices and, if so, no changes are made, otherwiseaccess function of the inner dimension depends on thread index, then thearray is marked to be transposed while it is being copied from globalmemory. In an embodiment, sample code below illustrates generated codewhere the array A is transposed as it is being copied to shared arrayAs, so that any later uses of A are free of shared bank conflicts.

In an embodiment, array padding involves an array that is transposedwhile it is stored, which introduces shared memory bank conflicts duringstore operation since the access function of inner dimension as it isbeing stored is a function of thread index. In an embodiment, suchconflicts are prevented by padding arrays with potential conflicts withadditional elements. In an embodiment, padding sizes are calculated asfollows: first checking if the array access stride is a multiple ofshared memory bank size and, if so, determining the number ofconsecutive banks n that are accessed by threads from a single warp, andpadding the leading dimension of the array with n words.

In an embodiment, loops that copy in/out data from global to sharedmemory are strip-mined so that the copy operation is translated tovector loads/stores in the final executable. In an embodiment, stripmining is possible when both load and store addresses of consecutiveiterations of a copy statement are to consecutive memory locations. Inan embodiment, any arrays that are transposed on-the-fly during copy arenot copied as vector loads.

In an embodiment, the following code is utilized:

_(——)shared_(——) float As[2][8][128], Bs[2][8][128]; for (int k=0; k<=127; k +=1) {  // Global −> Shared copy  for (int i=0; i <=3; i +=1) {  As[buf ][lid_x ][32* wid_y +16* wid_x + lid_y +4* i]    = A[...];  Bs[ buf ][2* wid_y + wid_x ][8* lid_y + lid_x +32* i]    = B[128*bid_x +2048* wid_y +1024* wid_x+8* lid_y + lid_x +8192* k +32* i];  } _(——)syncthreads ( );  for (int kk = 0; kk <= 7; kk += 1) {   // Shared−> Reg copy   for (int i=0; i <=1; i += 1)    for (int j=0; j <=3; j+= 1)     Ap[4* i+j ][0] = As[buf][kk][32* wid_y +4* lid_y +16* i+j]; for (int i = 0; i <= 1; i += 1)   for (int j = 0; j <= 3; j += 1)   Bp[0][4* i+j] = Bs[buf][ kk ][64* wid_x +4* lid_x +32* i+j];  //Compute  for (int i = 0; i <=1; i +=1)   for (int j = 0; j <=1; j +=1)   for (int ii = 0; ii <=3; ii +=1)     for (int jj = 0; jj <=3; jj +=1)     Cp[4* i+ii ][4* j+jj] += Ap[4* i+ii ][0] * Bp[0][4* j+jj ];  }  buf= (buf ? 0 : 1); }

In an embodiment, the code above shows the generated code where thearray A is transposed as it is copied to shared array As, so that anylater uses of A are free of shared bank conflicts. In an embodiment, theaccess As[buf][lid x][32*wid y+16*wid x+lid y+4*i] shown above in whichconsecutive threads access elements at a stride of 128, each warp copiesfour columns of data from global to shared memory, making each warpaccess data from four consecutive bank, and the array As is padded with4 words to prevent conflicts and the padded array As is declared asshared float As[2][8][128+4]. In the copy operation Bs[buf][2*wid y+widx][8*lid y+lid x+32*i]=B[128*bid x+2048*wid y+1024*wid x+8*lid y+lidx+8192*k+32*i]; consecutive threads load consecutive words from globalmemory and store to consecutive locations in shared memory.

In an embodiment, access pattern to input arrays of linear algebracomputations are regular and multiple threads require same data, such asin case of matrix-matrix multiplication, different threads that computedifferent columns of a single row need to read same element of array A,and different threads that compute different rows of a single columnread same element of array B. In an embodiment, different GPUarchitectures have support for local broadcast of data between threadsaccessing same elements from shared memory, if their thread ids matchcertain bit pattern, thus improving shared memory load bandwidth. In anembodiment, this bit pattern, referred to as a swizzling pattern, ishardware dependent. In an embodiment, instead of consecutive lanes in awarp computing consecutive iterations of computation, the lanes arepermuted/swizzled such that their read accesses match the swizzlingpattern of the architecture. In an embodiment, these patterns arerepresented as affine functions and once an affine representation of theswizzling pattern is constructed from hardware specific information, thepermutation affine map is applied to the already computed schedule toobtain a schedule optimized with swizzling.

In an embodiment, the system performs steps for generating 410executable code from the schedule. In an embodiment, once optimalschedules for computation statements, and various copy statements havebeen found, CUDA code corresponding to the schedule is generated and ISLprovides a sequence of loop nests corresponding to the schedule. In anembodiment, outermost parallel band of the generated code is distributedamong threadblocks, and the next two levels of parallelbands—corresponding to warps and lanes—are distributed among threads.

In an embodiment, the optimized executable code corresponding to thematrix-matrix multiplication expression C=A*B; (without loop unrollingperformed) is as follows:

_(——)global_(——) void ladsl_kern0 ( const float * A, const float * B,float * _(——)restrict_(——) C) {  int bid_y = blockIdx .y, bid_x =blockIdx .x;  int tid_y = threadIdx .y, tid_x = threadIdx .x;  intlinearId = tid_y *16+ tid_x ;  int warpId = linearId /32;  int laneId =linearId %32;  int wid_y = warpId /2, wid_x = warpId %2;  int lid_y =laneId /8, lid_x = laneId %8;  _(——)shared_(——) _(——)align_(——) (8)float As [2][8][128+4];  _(——)shared_(——) _(——)align_(——) (8) float Bs[2][8][128+0];  float Ap [8][1] , Bp [1][8] , Cp [8][8] = {-0};  int buf= 0;  for (int k =0; k <=127; k +=1) {   // Global −> Shared copy   for(int i =0; i <=3; i +=1)    As[buf ][ lid_x ][32* wid_y +16* wid_x +lid_y +4* i]     = A [131072* bid_y +32768* wid_y +16384* wid_x    +1024* lid_y + lid_x +8*k +4096* i];   for (int i =0; i <=1; i +=1)   for (int j =0; j <=1; j +=1)     Bs[ buf ][2* wid_y + wid_x ][16*lid_y +2* lid_x +64* i+j]      = B [128* bid_x +2048* wid_y +1024* wid_x     +16* lid_y +2* lid_x +8192* k +64* i+j];   _(——)syncthreads ( );  for (int kk = 0; kk <= 7; kk += 1) {    // Shared −> Reg copy    for(int i =0; i <=1; i +=1)     for (int j =0; j <=3; j +=1)      Ap [4*i+j][0] = As[buf ][ kk ][4*( lid_x %2)+32* wid_y       +16* i+j +8*((8*lid_y + lid_x )/16)];    for (int i = 0; i <= 1; i += 1)     for (int j= 0; j <= 3; j += 1)      Bp [0][4* i+j] = Bs[buf ][ kk ][2*((8* lid_y +lid_x )%16)       −2*( lid_x %2)+64* wid_x +32* i+j];    // Compute   for (int i =0; i <=1; i +=1)     for (int j =0; j <=1; j +=1)     for (int ii =0; ii <=3; ii +=1)       for (int jj =0; jj <=3; jj+=1)        Cp [4*i+ii ][4* j+jj] +=         Ap [4*i+ii ][0] * Bp [0][4*j+jj ];   }   buf = ( buf ? 0 : 1);  }  // Registers −> Global copy  for(int i = 0; i <= 1; i += 1)   for (int j = 0; j <= 1; j += 1)    for(int ii = 0; ii <= 3; ii += 1)     for (int jj = 0; jj <= 3; jj += 1)     C [4094*( lid_x %2) −510*((8* lid_y + lid_x )%16)       +131072*bid_y +128* bid_x +32768* wid_y       +64* wid_x+4096* lid_y +512* lid_x      +16384* i +32* j +1024* ii+jj]         = Cp [4* i+ii ][4* j+jj ];}

FIG. 5 illustrates a parallel processing unit (“PPU”) 500, in accordancewith one embodiment. In an embodiment, the PPU 500 is configured withmachine-readable code that, if executed by the PPU, causes the PPU toperform some or all of processes and techniques described throughoutthis disclosure. In an embodiment, the PPU 500 is a multi-threadedprocessor that is implemented on one or more integrated circuit devicesand that utilizes multithreading as a latency-hiding technique designedto process computer-readable instructions (also referred to asmachine-readable instructions or simply instructions) on multiplethreads in parallel. In an embodiment, a thread refers to a thread ofexecution and is an instantiation of a set of instructions configured tobe executed by the PPU 500. In an embodiment, the PPU 500 is a graphicsprocessing unit (“GPU”) configured to implement a graphics renderingpipeline for processing three-dimensional (“3D”) graphics data in orderto generate two-dimensional (“2D”) image data for display on a displaydevice such as a liquid crystal display (LCD) device. In an embodiment,the PPU 500 is utilized for perform computations such as linear algebraoperations and machine-learning operations. FIG. 5 illustrates anexample parallel processor for illustrative purposes only and should beconstrued as a non-limiting example of processor architecturescontemplated within the scope of this disclosure and that any suitableprocessor may be employed to supplement and/or substitute for the same.

In an embodiment, one or more PPUs are configured to accelerate HighPerformance Computing (“HPC”), data center, and machine learningapplications. In an embodiment, the PPU 500 is configured to acceleratedeep learning systems and applications including the followingnon-limiting examples: autonomous vehicle platforms, deep learning,high-accuracy speech, image, text recognition systems, intelligent videoanalytics, molecular simulations, drug discovery, disease diagnosis,weather forecasting, big data analytics, astronomy, molecular dynamicssimulation, financial modeling, robotics, factory automation, real-timelanguage translation, online search optimizations, and personalized userrecommendations, and more.

In an embodiment, the PPU 500 includes an Input/Output (“I/O”) unit 505,a front-end unit 510, a scheduler unit 512, a work distribution unit514, a hub 516, a crossbar (“Xbar”) 520, one or more general processingclusters (“GPCs”) 518, and one or more partition units 522. In anembodiment, the PPU 500 is connected to a host processor or other PPUs500 via one or more high-speed GPU interconnects 58. In an embodiment,the PPU 500 is connected to a host processor or other peripheral devicesvia an interconnect 502. In an embodiment, the PPU 500 is connected to alocal memory comprising one or more memory devices 504. In anembodiment, the local memory comprises one or more dynamic random accessmemory (“DRAM”) devices. In an embodiment, the one or more DRAM devicesare configured and/or configurable as high-bandwidth memory (“HBM”)subsystems, with multiple DRAM dies stacked within each device.

The high-speed GPU interconnect 58 may refer to a wire-based multi-lanecommunications link that is used by systems to scale and include one ormore PPUs 500 combined with one or more CPUs, supports cache coherencebetween the PPUs 500 and CPUs, and CPU mastering. In an embodiment, dataand/or commands are transmitted by the high-speed GPU interconnect 58through the hub 516 to/from other units of the PPU 500 such as one ormore copy engines, video encoders, video decoders, power managementunits, and other components which may not be explicitly illustrated inFIG. 5 .

In an embodiment, the I/O unit 505 is configured to transmit and receivecommunications (e.g., commands, data) from a host processor (notillustrated in FIG. 5 ) over the system bus 502. In an embodiment, theI/O unit 505 communicates with the host processor directly via thesystem bus 502 or through one or more intermediate devices such as amemory bridge. In an embodiment, the I/O unit 505 may communicate withone or more other processors, such as one or more the PPUs 500 via thesystem bus 502. In an embodiment, the I/O unit 505 implements aPeripheral Component Interconnect Express (“PCIe”) interface forcommunications over a PCIe bus. In an embodiment, the I/O unit 505implements interfaces for communicating with external devices.

In an embodiment, the I/O unit 505 decodes packets received via thesystem bus 502. In an embodiment, at least some packets representcommands configured to cause the PPU 500 to perform various operations.In an embodiment, the I/O unit 505 transmits the decoded commands tovarious other units of the PPU 500 as specified by the commands. In anembodiment, commands are transmitted to the front-end unit 510 and/ortransmitted to the hub 516 or other units of the PPU 500 such as one ormore copy engines, a video encoder, a video decoder, a power managementunit, etc. (not explicitly illustrated in FIG. 5 ). In an embodiment,the I/O unit 505 is configured to route communications between and amongthe various logical units of the PPU 500.

In an embodiment, a program executed by the host processor encodes acommand stream in a buffer that provides workloads to the PPU 500 forprocessing. In an embodiment, a workload comprises instructions and datato be processed by those instructions. In an embodiment, the buffer is aregion in a memory that is accessible (e.g., read/write) by both thehost processor and the PPU 500—the host interface unit may be configuredto access the buffer in a system memory connected to the system bus 502via memory requests transmitted over the system bus 502 by the I/O unit505. In an embodiment, the host processor writes the command stream tothe buffer and then transmits a pointer to the start of the commandstream to the PPU 500 such that the front-end unit 510 receives pointersto one or more command streams and manages the one or more streams,reading commands from the streams and forwarding commands to the variousunits of the PPU 500.

In an embodiment, the front-end unit 510 is coupled to a scheduler unit512 that configures the various GPCs 518 to process tasks defined by theone or more streams. In an embodiment, the scheduler unit 512 isconfigured to track state information related to the various tasksmanaged by the scheduler unit 512 where the state information mayindicate which GPC 518 a task is assigned to, whether the task is activeor inactive, a priority level associated with the task, and so forth. Inan embodiment, the scheduler unit 512 manages the execution of aplurality of tasks on the one or more GPCs 518.

In an embodiment, the scheduler unit 512 is coupled to a workdistribution unit 514 that is configured to dispatch tasks for executionon the GPCs 518. In an embodiment, the work distribution unit 514 tracksa number of scheduled tasks received from the scheduler unit 512 and thework distribution unit 514 manages a pending task pool and an activetask pool for each of the GPCs 518. In an embodiment, the pending taskpool comprises a number of slots (e.g., 32 slots) that contain tasksassigned to be processed by a particular GPC 518; the active task poolmay comprise a number of slots (e.g., 4 slots) for tasks that areactively being processed by the GPCs 518 such that as a GPC 518completes the execution of a task, that task is evicted from the activetask pool for the GPC 518 and one of the other tasks from the pendingtask pool is selected and scheduled for execution on the GPC 518. In anembodiment, if an active task is idle on the GPC 518, such as whilewaiting for a data dependency to be resolved, then the active task isevicted from the GPC 518 and returned to the pending task pool whileanother task in the pending task pool is selected and scheduled forexecution on the GPC 518.

In an embodiment, the work distribution unit 514 communicates with theone or more GPCs 518 via XBar 520. In an embodiment, the XBar 520 is aninterconnect network that couples many of the units of the PPU 500 toother units of the PPU 500 and can be configured to couple the workdistribution unit 514 to a particular GPC 518. Although not shownexplicitly, one or more other units of the PPU 500 may also be connectedto the XBar 520 via the hub 516.

The tasks are managed by the scheduler unit 512 and dispatched to a GPC518 by the work distribution unit 514. The GPC 518 is configured toprocess the task and generate results. The results may be consumed byother tasks within the GPC 518, routed to a different GPC 518 via theXBar 520, or stored in the memory 504. The results can be written to thememory 504 via the partition units 522, which implement a memoryinterface for reading and writing data to/from the memory 504. Theresults can be transmitted to another PPU 504 or CPU via the high-speedGPU interconnect 58. In an embodiment, the PPU 500 includes a number Uof partition units 522 that is equal to the number of separate anddistinct memory devices 504 coupled to the PPU 500. A partition unit 522will be described in more detail below in conjunction with FIG. 7 .

In an embodiment, a host processor executes a driver kernel thatimplements an application programming interface (“API”) that enables oneor more applications executing on the host processor to scheduleoperations for execution on the PPU 500. In an embodiment, multiplecompute applications are simultaneously executed by the PPU 500 and thePPU 500 provides isolation, quality of service (“QoS”), and independentaddress spaces for the multiple compute applications. In an embodiment,an application generates instructions (e.g., in the form of API calls)that cause the driver kernel to generate one or more tasks for executionby the PPU 500 and the driver kernel outputs tasks to one or morestreams being processed by the PPU 500. In an embodiment, each taskcomprises one or more groups of related threads, which may be referredto as a warp. In an embodiment, a warp comprises a plurality of relatedthreads (e.g., 32 threads) that can be executed in parallel. In anembodiment, cooperating threads can refer to a plurality of threadsincluding instructions to perform the task and that exchange datathrough shared memory. Threads and cooperating threads are described inmore detail, in accordance with one embodiment, in conjunction with FIG.7A.

FIG. 6 illustrates a GPC 600 such as the GPC illustrated of the PPU 500of FIG. 5 , in accordance with one embodiment. In an embodiment, eachGPC 600 includes a number of hardware units for processing tasks andeach GPC 600 includes a pipeline manager 602, a pre-raster operationsunit (“PROP”) 604, a raster engine 608, a work distribution crossbar(“WDX”) 616, a memory management unit (“MMU”) 618, one or more DataProcessing Clusters (“DPCs”) 606, and any suitable combination of parts.It will be appreciated that the GPC 600 of FIG. 6 may include otherhardware units in lieu of or in addition to the units shown in FIG. 6 .

In an embodiment, the operation of the GPC 600 is controlled by thepipeline manager 602. The pipeline manager 602 manages the configurationof the one or more DPCs 606 for processing tasks allocated to the GPC600. In an embodiment, the pipeline manager 602 configures at least oneof the one or more DPCs 606 to implement at least a portion of agraphics rendering pipeline. In an embodiment, a DPC 606 is configuredto execute a vertex shader program on the programmable streamingmultiprocessor (“SM”) 614. The pipeline manager 602 is configured toroute packets received from a work distribution to the appropriatelogical units within the GPC 600, in an embodiment, and some packets maybe routed to fixed function hardware units in the PROP 604 and/or rasterengine 608 while other packets may be routed to the DPCs 606 forprocessing by the primitive engine 612 or the SM 614. In an embodiment,the pipeline manager 602 configures at least one of the one or more DPCs606 to implement a neural network model and/or a computing pipeline.

The PROP unit 604 is configured, in an embodiment, to route datagenerated by the raster engine 608 and the DPCs 606 to a RasterOperations (“ROP”) unit in the memory partition unit, described in moredetail above. In an embodiment, the PROP unit 604 is configured toperform optimizations for color blending, organize pixel data, performaddress translations, and more. The raster engine 608 includes a numberof fixed function hardware units configured to perform various rasteroperations, in an embodiment, and the raster engine 608 includes a setupengine, a coarse raster engine, a culling engine, a clipping engine, afine raster engine, a tile coalescing engine, and any suitablecombination thereof. The setup engine, in an embodiment, receivestransformed vertices and generates plane equations associated with thegeometric primitive defined by the vertices; the plane equations aretransmitted to the coarse raster engine to generate coverage information(e.g., an x, y coverage mask for a tile) for the primitive; the outputof the coarse raster engine is transmitted to the culling engine wherefragments associated with the primitive that fail a z-test are culled,and transmitted to a clipping engine where fragments lying outside aviewing frustum are clipped. IN an embodiment, the fragments thatsurvive clipping and culling are passed to the fine raster engine togenerate attributes for the pixel fragments based on the plane equationsgenerated by the setup engine. In an embodiment, the output of theraster engine 608 comprises fragments to be processed by any suitableentity such as by a fragment shader implemented within a DPC 606.

In an embodiment, each DPC 606 included in the GPC 600 comprises anM-Pipe Controller (“MPC”) 610; a primitive engine 612; one or more SMs614; and any suitable combination thereof. In an embodiment, the MPC 610controls the operation of the DPC 606, routing packets received from thepipeline manager 602 to the appropriate units in the DPC 606. In anembodiment, packets associated with a vertex are routed to the primitiveengine 612, which is configured to fetch vertex attributes associatedwith the vertex from memory; in contrast, packets associated with ashader program may be transmitted to the SM 614.

In an embodiment, the SM 614 comprises a programmable streamingprocessor that is configured to process tasks represented by a number ofthreads. In an embodiment, the SM 614 is multi-threaded and configuredto execute a plurality of threads (e.g., 32 threads) from a particulargroup of threads concurrently and implements a SIMD (Single-Instruction,Multiple-Data) architecture where each thread in a group of threads(e.g., a warp) is configured to process a different set of data based onthe same set of instructions. In an embodiment, all threads in the groupof threads execute the same instructions. In an embodiment, the SM 614implements a SIMT (Single-Instruction, Multiple Thread) architecturewherein each thread in a group of threads is configured to process adifferent set of data based on the same set of instructions, but whereindividual threads in the group of threads are allowed to diverge duringexecution. In an embodiment, a program counter, call stack, andexecution state is maintained for each warp, enabling concurrencybetween warps and serial execution within warps when threads within thewarp diverge. In another embodiment, a program counter, call stack, andexecution state is maintained for each individual thread, enabling equalconcurrency between all threads, within and between warps. In anembodiment, execution state is maintained for each individual thread andthreads executing the same instructions may be converged and executed inparallel for better efficiency. In an embodiment, the SM 614 isdescribed in more detail below.

In an embodiment, the MMU 618 provides an interface between the GPC 600and the memory partition unit and the MMU 618 provides translation ofvirtual addresses into physical addresses, memory protection, andarbitration of memory requests. In an embodiment, the MMU 618 providesone or more translation lookaside buffers (“TLBs”) for performingtranslation of virtual addresses into physical addresses in memory.

FIG. 7 illustrates a memory partition unit of a PPU, in accordance withone embodiment. In an embodiment, the memory partition unit 700 includesa Raster Operations (“ROP”) unit 702; a level two (“L2”) cache 704; amemory interface 706; and any suitable combination thereof. The memoryinterface 706 is coupled to the memory. Memory interface 706 mayimplement 32, 64, 128, 1024-bit data buses, or the like, for high-speeddata transfer. In an embodiment, the PPU incorporates U memoryinterfaces 706, one memory interface 706 per pair of partition units700, where each pair of partition units 700 is connected to acorresponding memory device. For example, PPU may be connected to up toY memory devices, such as high bandwidth memory stacks or graphicsdouble-data-rate, version 5, synchronous dynamic random access memory(“GDDR5 SDRAM”).

In an embodiment, the memory interface 706 implements an HBM2 memoryinterface and Y equals half U. In an embodiment, the HBM2 memory stacksare located on the same physical package as the PPU, providingsubstantial power and area savings compared with conventional GDDR5SDRAM systems. In an embodiment, each HBM2 stack includes four memorydies and Y equals 4, with HBM2 stack including two 128-bit channels perdie for a total of 8 channels and a data bus width of 1024 bits.

In an embodiment, the memory supports Single-Error CorrectingDouble-Error Detecting (“SECDED”) Error Correction Code (“ECC”) toprotect data. ECC provides higher reliability for compute applicationsthat are sensitive to data corruption. Reliability is especiallyimportant in large-scale cluster computing environments where PPUsprocess very large datasets and/or run applications for extendedperiods.

In an embodiment, the PPU implements a multi-level memory hierarchy. Inan embodiment, the memory partition unit 700 supports a unified memoryto provide a single unified virtual address space for CPU and PPUmemory, enabling data sharing between virtual memory systems. In anembodiment the frequency of accesses by a PPU to memory located on otherprocessors is trace to ensure that memory pages are moved to thephysical memory of the PPU that is accessing the pages more frequently.In an embodiment, the high-speed GPU interconnect 58 supports addresstranslation services allowing the PPU to directly access a CPU's pagetables and providing full access to CPU memory by the PPU.

In an embodiment, copy engines transfer data between multiple PPUs orbetween PPUs and CPUs. In an embodiment, the copy engines can generatepage faults for addresses that are not mapped into the page tables andthe memory partition unit 700 then services the page faults, mapping theaddresses into the page table, after which the copy engine performs thetransfer. In an embodiment, memory is pinned (i.e., non-pageable) formultiple copy engine operations between multiple processors,substantially reducing the available memory. In an embodiment, withhardware page faulting, addresses can be passed to the copy engineswithout regard as to whether the memory pages are resident, and the copyprocess is transparent.

Data from the memory of FIG. 5 or other system memory is fetched by thememory partition unit 700 and stored in the L2 cache 704, which islocated on-chip and is shared between the various GPCs, in accordancewith one embodiment. Each memory partition unit 700, in an embodiment,includes at least a portion of the L2 cache 660 associated with acorresponding memory device. In an embodiment, lower level caches areimplemented in various units within the GPCs. In an embodiment, each ofthe SMs 740 may implement a level one (“L1”) cache wherein the L1 cacheis private memory that is dedicated to a particular SM 740 and data fromthe L2 cache 704 is fetched and stored in each of the L1 caches forprocessing in the functional units of the SMs 740. In an embodiment, theL2 cache 704 is coupled to the memory interface 706 and the XBar 520.

The ROP unit 702 performs graphics raster operations related to pixelcolor, such as color compression, pixel blending, and more, in anembodiment. The ROP unit $$50, in an embodiment, implements depthtesting in conjunction with the raster engine 725, receiving a depth fora sample location associated with a pixel fragment from the cullingengine of the raster engine 725. In an embodiment, the depth is testedagainst a corresponding depth in a depth buffer for a sample locationassociated with the fragment. In an embodiment, if the fragment passesthe depth test for the sample location, then the ROP unit 702 updatesthe depth buffer and transmits a result of the depth test to the rasterengine 725. It will be appreciated that the number of partition units700 may be different than the number of GPCs and, therefore, each ROPunit 702 can, in an embodiment, be coupled to each of the GPCs. In anembodiment, the ROP unit 702 tracks packets received from the differentGPCs and determines which that a result generated by the ROP unit 702 isrouted to through the Xbar.

FIG. 8 illustrates a streaming multi-processor such as the streamingmulti-processor of FIG. 6 , in accordance with one embodiment. In anembodiment, the SM 800 includes: an instruction cache 802; one or morescheduler units 804; a register file 808; one or more processing cores810; one or more special function units (“SFUs”) 812; one or moreload/store units (“LSUs”) 814; an interconnect network 816; a sharedmemory/L1 cache 818; and any suitable combination thereof. In anembodiment, the work distribution unit dispatches tasks for execution onthe GPCs of the PPU and the each task is allocated to a particular DPCwithin a GPC and, if the task is associated with a shader program, thetask is allocated to an SM 800. In an embodiment, the scheduler unit 804receives the tasks from the work distribution unit and managesinstruction scheduling for one or more thread blocks assigned to the SM800. In an embodiment, the scheduler unit 804 schedules thread blocksfor execution as warps of parallel threads, wherein each thread block isallocated at least one warp. In an embodiment, each warp executesthreads. In an embodiment, the scheduler unit 804 manages a plurality ofdifferent thread blocks, allocating the warps to the different threadblocks and then dispatching instructions from the plurality of differentcooperative groups to the various functional units (e.g., cores 810,SFUs 812, and LSUs 814) during each clock cycle.

Cooperative Groups may refer to a programming model for organizinggroups of communicating threads that allows developers to express thegranularity at which threads are communicating, enabling the expressionof richer, more efficient parallel decompositions. In an embodiment,cooperative launch APIs support synchronization amongst thread blocksfor the execution of parallel algorithms. In an embodiment, applicationsof conventional programming models provide a single, simple constructfor synchronizing cooperating threads: a barrier across all threads of athread block (e.g., the syncthreads( ) function). However, programmerswould often like to define groups of threads at smaller than threadblock granularities and synchronize within the defined groups to enablegreater performance, design flexibility, and software reuse in the formof collective group-wide function interfaces. Cooperative Groups enablesprogrammers to define groups of threads explicitly at sub-block (i.e.,as small as a single thread) and multi-block granularities, and toperform collective operations such as synchronization on the threads ina cooperative group. The programming model supports clean compositionacross software boundaries, so that libraries and utility functions cansynchronize safely within their local context without having to makeassumptions about convergence. Cooperative Groups primitives enable newpatterns of cooperative parallelism, including producer-consumerparallelism, opportunistic parallelism, and global synchronizationacross an entire grid of thread blocks.

In an embodiment, a dispatch unit 806 is configured to transmitinstructions to one or more of the functional units and the schedulerunit 804 includes two dispatch units 806 that enable two differentinstructions from the same warp to be dispatched during each clockcycle. In an embodiment, each scheduler unit 804 includes a singledispatch unit 806 or additional dispatch units 806.

Each SM 800, in an embodiment, includes a register file 808 thatprovides a set of registers for the functional units of the SM 800. Inan embodiment, the register file 808 is divided between each of thefunctional units such that each functional unit is allocated a dedicatedportion of the register file 808. In an embodiment, the register file808 is divided between the different warps being executed by the SM 800and the register file 808 provides temporary storage for operandsconnected to the data paths of the functional units. In an embodiment,each SM 800 comprises a plurality of L processing cores 810. In anembodiment, the SM 800 includes a large number (e.g., 128 or more) ofdistinct processing cores 810. Each core 810, in an embodiment, includesa fully-pipelined, single-precision, double-precision, and/or mixedprecision processing unit that includes a floating point arithmeticlogic unit and an integer arithmetic logic unit. In an embodiment, thefloating point arithmetic logic units implement the IEEE 754-2008standard for floating point arithmetic. In an embodiment, the cores 810include 64 single-precision (32-bit) floating point cores, 64 integercores, 32 double-precision (64-bit) floating point cores, and 8 tensorcores.

Tensor cores are configured to perform matrix operations in accordancewith an embodiment. In an embodiment, one or more tensor cores areincluded in the cores 810. In an embodiment, the tensor cores areconfigured to perform deep learning matrix arithmetic, such asconvolution operations for neural network training and inferencing. Inan embodiment, each tensor core operates on a 4×4 matrix and performs amatrix multiply and accumulate operation D=A×B+C, where A, B, C, and Dare 4×4 matrices.

In an embodiment, the matrix multiply inputs A and B are 16-bit floatingpoint matrices and the accumulation matrices C and D are 16-bit floatingpoint or 32-bit floating point matrices. In an embodiment, the tensorcores operate on 16-bit floating point input data with 32-bit floatingpoint accumulation. In an embodiment, the 16-bit floating point multiplyrequires 64 operations and results in a full precision product that isthen accumulated using 32-bit floating point addition with the otherintermediate products for a 4×4×4 matrix multiply. Tensor cores are usedto perform much larger two-dimensional or higher dimensional matrixoperations, built up from these smaller elements, in an embodiment. Inan embodiment, an API, such as CUDA 9 C++ API, exposes specializedmatrix load, matrix multiply and accumulate, and matrix store operationsto efficiently use tensor cores from a CUDA-C++ program. In anembodiment, at the CUDA level, the warp-level interface assumes 16×16size matrices spanning all 32 threads of the warp.

In an embodiment, each SM 800 comprises M SFUs 812 that perform specialfunctions (e.g., attribute evaluation, reciprocal square root, and thelike). In an embodiment, the SFUs 812 include a tree traversal unitconfigured to traverse a hierarchical tree data structure. In anembodiment, the SFUs 812 include texture unit configured to performtexture map filtering operations. In an embodiment, the texture unitsare configured to load texture maps (e.g., a 2D array of texels) fromthe memory and sample the texture maps to produce sampled texture valuesfor use in shader programs executed by the SM 800. In an embodiment, thetexture maps are stored in the shared memory/L1 cache. The texture unitsimplement texture operations such as filtering operations using mip-maps(e.g., texture maps of varying levels of detail), in accordance with oneembodiment. In an embodiment, each SM 800 includes two texture units.

Each SM 800 comprises N LSUs 754 that implement load and storeoperations between the shared memory/L1 cache 706 and the register file808, in an embodiment. Each SM 800 includes an interconnect network 816that connects each of the functional units to the register file 808 andthe LSU 814 to the register file 808, shared memory/L1 cache 818 in anembodiment. In an embodiment, the interconnect network 816 is a crossbarthat can be configured to connect any of the functional units to any ofthe registers in the register file 808 and connect the LSUs 814 to theregister file and memory locations in shared memory/L1 cache 818.

The shared memory/L1 cache 818 is an array of on-chip memory that allowsfor data storage and communication between the SM 800 and the primitiveengine and between threads in the SM 800 in an embodiment. In anembodiment, the shared memory/L1 cache 818 comprises 128 KB of storagecapacity and is in the path from the SM 800 to the partition unit. Theshared memory/L1 cache 818, in an embodiment, is used to cache reads andwrites. One or more of the shared memory/L1 cache 818, L2 cache, andmemory are backing stores.

Combining data cache and shared memory functionality into a singlememory block provides improved performance for both types of memoryaccesses, in an embodiment. The capacity, in an embodiment, is used oris usable as a cache by programs that do not use shared memory, such asif shared memory is configured to use half of the capacity, texture andload/store operations can use the remaining capacity. Integration withinthe shared memory/L1 cache 818 enables the shared memory/L1 cache 818 tofunction as a high-throughput conduit for streaming data whilesimultaneously providing high-bandwidth and low-latency access tofrequently reused data, in accordance with an embodiment. Whenconfigured for general purpose parallel computation, a simplerconfiguration can be used compared with graphics processing. In anembodiment, fixed function graphics processing units are bypassed,creating a much simpler programming model. In the general purposeparallel computation configuration, the work distribution unit assignsand distributes blocks of threads directly to the DPCs, in anembodiment. The threads in a block execute the same program, using aunique thread ID in the calculation to ensure each thread generatesunique results, using the SM 800 to execute the program and performcalculations, shared memory/L1 cache 818 to communicate between threads,and the LSU 814 to read and write global memory through the sharedmemory/L1 cache 818 and the memory partition unit, in accordance withone embodiment. In an embodiment, when configured for general purposeparallel computation, the SM 800 writes commands that the scheduler unitthat can be used to launch new work on the DPCs.

In an embodiment, the PPU is included in or coupled to a desktopcomputer, a laptop computer, a tablet computer, servers, supercomputers,a smart-phone (e.g., a wireless, hand-held device), personal digitalassistant (“PDA”), a digital camera, a vehicle, a head mounted display,a hand-held electronic device, and more. In an embodiment, the PPU isembodied on a single semiconductor substrate. In an embodiment, the PPUis included in a system-on-a-chip (“SoC”) along with one or more otherdevices such as additional PPUs, the memory, a reduced instruction setcomputer (“RISC”) CPU, a memory management unit (“MMU”), adigital-to-analog converter (“DAC”), and the like.

In an embodiment, the PPU may be included on a graphics card thatincludes one or more memory devices. The graphics card may be configuredto interface with a PCIe slot on a motherboard of a desktop computer. Inyet another embodiment, the PPU may be an integrate graphics processingunit (“iGPU”) included in the chipset of the motherboard.

FIG. 9 illustrates a computer system 900 in which the variousarchitecture and/or functionality can be implemented, in accordance withone embodiment. The computer system 900, in an embodiment, is configuredto implement various processes and methods described throughout thisdisclosure.

In an embodiment, the computer system 900 comprises at least one centralprocessing unit 902 that is connected to a communication bus 910implemented using any suitable protocol, such as PCI (PeripheralComponent Interconnect), PCI-Express, AGP (Accelerated Graphics Port),HyperTransport, or any other bus or point-to-point communicationprotocol(s). In an embodiment, the computer system 900 includes a mainmemory 904 and control logic (e.g., implemented as hardware, software,or a combination thereof) and data are stored in the main memory 904which may take the form of random access memory (“RAM”). In anembodiment, a network interface subsystem 922 provides an interface toother computing devices and networks for receiving data from andtransmitting data to other systems from the computer system 900.

The computer system 900, in an embodiment, includes input devices 908,the parallel processing system 912, and display devices 906 which can beimplemented using a conventional CRT (cathode ray tube), LCD (liquidcrystal display), LED (light emitting diode), plasma display, or othersuitable display technologies. In an embodiment, user input is receivedfrom input devices 908 such as keyboard, mouse, touchpad, microphone,and more. In an embodiment, each of the foregoing modules can besituated on a single semiconductor platform to form a processing system.

In the present description, a single semiconductor platform may refer toa sole unitary semiconductor-based integrated circuit or chip. It shouldbe noted that the term single semiconductor platform may also refer tomulti-chip modules with increased connectivity which simulate on-chipoperation, and make substantial improvements over utilizing aconventional central processing unit (“CPU”) and bus implementation. Ofcourse, the various modules may also be situated separately or invarious combinations of semiconductor platforms per the desires of theuser.

In an embodiment, computer programs in the form of machine-readableexecutable code or computer control logic algorithms are stored in themain memory 904 and/or secondary storage. Computer programs, if executedby one or more processors, enable the system 900 to perform variousfunctions in accordance with one embodiment. The memory 904, thestorage, and/or any other storage are possible examples ofcomputer-readable media. Secondary storage may refer to any suitablestorage device or system such as a hard disk drive and/or a removablestorage drive, representing a floppy disk drive, a magnetic tape drive,a compact disk drive, digital versatile disk (“DVD”) drive, recordingdevice, universal serial bus (“USB”) flash memory.

In an embodiment, the architecture and/or functionality of the variousprevious figures are implemented in the context of the central processor902; parallel processing system 912; an integrated circuit capable of atleast a portion of the capabilities of both the central processor 902the parallel processing system 912; a chipset (e.g., a group ofintegrated circuits designed to work and sold as a unit for performingrelated functions, etc.); and any suitable combination of integratedcircuit.

In an embodiment, the architecture and/or functionality of the variousprevious FIGS. is be implemented in the context of a general computersystem, a circuit board system, a game console system dedicated forentertainment purposes, an application-specific system, and more. In anembodiment, the computer system 900 may take the form of a desktopcomputer, a laptop computer, a tablet computer, servers, supercomputers,a smart-phone (e.g., a wireless, hand-held device), personal digitalassistant (“PDA”), a digital camera, a vehicle, a head mounted display,a hand-held electronic device, a mobile phone device, a television,workstation, game consoles, embedded system, and/or any other type oflogic.

In an embodiment, a parallel processing system 912 includes a pluralityof PPUs 914 and associated memories 916. In an embodiment, the PPUs areconnected to a host processor or other peripheral devices via aninterconnect 918 and a switch 920 or multiplexer. In an embodiment, theparallel processing system 912 distributes computational tasks acrossthe PPUs 914 which can be parallelizable—for example, as part of thedistribution of computational tasks across multiple GPU thread blocks.In an embodiment, memory is shared and accessible (e.g., for read and/orwrite access) across some or all of the PPUs 914, although such sharedmemory may incur performance penalties relative to the use of localmemory and registers resident to a PPU. In an embodiment, the operationof the PPUs 914 is synchronized through the use of a command such as_syncthreads( ) which requires all threads in a block (e.g., executedacross multiple PPUs 914) to reach a certain point of execution of codebefore proceeding.

The specification and drawings are, accordingly, to be regarded in anillustrative rather than a restrictive sense. It will, however, beevident that various modifications and changes may be made thereuntowithout departing from the broader spirit and scope of the invention asset forth in the claims.

Other variations are within the spirit of the present disclosure. Thus,while the disclosed techniques are susceptible to various modificationsand alternative constructions, certain illustrated embodiments thereofare shown in the drawings and have been described above in detail. Itshould be understood, however, that there is no intention to limit theinvention to the specific form or forms disclosed, but on the contrary,the intention is to cover all modifications, alternative constructions,and equivalents falling within the spirit and scope of the invention, asdefined in the appended claims.

The use of the terms “a” and “an” and “the” and similar referents in thecontext of describing the disclosed embodiments (especially in thecontext of the following claims) are to be construed to cover both thesingular and the plural, unless otherwise indicated herein or clearlycontradicted by context. The terms “comprising,” “having,” “including,”and “containing” are to be construed as open-ended terms (i.e., meaning“including, but not limited to,”) unless otherwise noted. The term“connected,” when unmodified and referring to physical connections, isto be construed as partly or wholly contained within, attached to, orjoined together, even if there is something intervening. Recitation ofranges of values herein are merely intended to serve as a shorthandmethod of referring individually to each separate value falling withinthe range, unless otherwise indicated herein and each separate value isincorporated into the specification as if it were individually recitedherein. The use of the term “set” (e.g., “a set of items”) or “subset”unless otherwise noted or contradicted by context, is to be construed asa nonempty collection comprising one or more members. Further, unlessotherwise noted or contradicted by context, the term “subset” of acorresponding set does not necessarily denote a proper subset of thecorresponding set, but the subset and the corresponding set may beequal.

Conjunctive language, such as phrases of the form “at least one of A, B,and C,” or “at least one of A, B and C,” unless specifically statedotherwise or otherwise clearly contradicted by context, is otherwiseunderstood with the context as used in general to present that an item,term, etc., may be either A or B or C, or any nonempty subset of the setof A and B and C. For instance, in the illustrative example of a sethaving three members, the conjunctive phrases “at least one of A, B, andC” and “at least one of A, B and C” refer to any of the following sets:{A}, {B}, {C}, {A, B}, {A, C}, {B, C}, {A, B, C}. Thus, such conjunctivelanguage is not generally intended to imply that certain embodimentsrequire at least one of A, at least one of B and at least one of C eachto be present. In addition, unless otherwise noted or contradicted bycontext, the term “plurality” indicates a state of being plural (e.g.,“a plurality of items” indicates multiple items). The number of items ina plurality is at least two, but can be more when so indicated eitherexplicitly or by context. Further, unless stated otherwise or otherwiseclear from context, the phrase “based on” means “based at least in parton” and not “based solely on.”

Operations of processes described herein can be performed in anysuitable order unless otherwise indicated herein or otherwise clearlycontradicted by context. In an embodiment, a process such as thoseprocesses described herein (or variations and/or combinations thereof)is performed under the control of one or more computer systemsconfigured with executable instructions and is implemented as code(e.g., executable instructions, one or more computer programs or one ormore applications) executing collectively on one or more processors, byhardware or combinations thereof. In an embodiment, the code is storedon a computer-readable storage medium, for example, in the form of acomputer program comprising a plurality of instructions executable byone or more processors. In an embodiment, a computer-readable storagemedium is a non-transitory computer-readable storage medium thatexcludes transitory signals (e.g., a propagating transient electric orelectromagnetic transmission) but includes non-transitory data storagecircuitry (e.g., buffers, cache, and queues) within transceivers oftransitory signals. In an embodiment, code (e.g., executable code orsource code) is stored on a set of one or more non-transitorycomputer-readable storage media having stored thereon executableinstructions (or other memory to store executable instructions) that,when executed (i.e., as a result of being executed) by one or moreprocessors of a computer system, cause the computer system to performoperations described herein. The set of non-transitory computer-readablestorage media, in an embodiment, comprises multiple non-transitorycomputer-readable storage media and one or more of individualnon-transitory storage media of the multiple non-transitorycomputer-readable storage media lack all of the code while the multiplenon-transitory computer-readable storage media collectively store all ofthe code. In an embodiment, the executable instructions are executedsuch that different instructions are executed by different processorsfor example, a non-transitory computer-readable storage medium storeinstructions and a main CPU execute some of the instructions while agraphics processor unit executes other instructions. In an embodiment,different components of a computer system have separate processors anddifferent processors execute different subsets of the instructions.

Accordingly, in an embodiment, computer systems are configured toimplement one or more services that singly or collectively performoperations of processes described herein and such computer systems areconfigured with applicable hardware and/or software that enable theperformance of the operations. Further, a computer system that implementan embodiment of the present disclosure is a single device and, inanother embodiment, is a distributed computer systems comprisingmultiple devices that operate differently such that the distributedcomputer system performs the operations described herein and such that asingle device does not perform all operations.

The use of any and all examples, or exemplary language (e.g., “such as”)provided herein, is intended merely to better illuminate embodiments ofthe invention and does not pose a limitation on the scope of theinvention unless otherwise claimed. No language in the specificationshould be construed as indicating any non-claimed element as essentialto the practice of the invention.

Embodiments of this disclosure are described herein, including the bestmode known to the inventors for carrying out the invention. Variationsof those embodiments may become apparent to those of ordinary skill inthe art upon reading the foregoing description. The inventors expectskilled artisans to employ such variations as appropriate and theinventors intend for embodiments of the present disclosure to bepracticed otherwise than as specifically described herein. Accordingly,the scope of the present disclosure includes all modifications andequivalents of the subject matter recited in the claims appended heretoas permitted by applicable law. Moreover, any combination of theabove-described elements in all possible variations thereof isencompassed by the scope of the present disclosure unless otherwiseindicated herein or otherwise clearly contradicted by context.

All references, including publications, patent applications, andpatents, cited herein are hereby incorporated by reference to the sameextent as if each reference were individually and specifically indicatedto be incorporated by reference and were set forth in its entiretyherein.

In the description and claims, the terms “coupled” and “connected,”along with their derivatives, may be used. It should be understood thatthese terms may be not intended as synonyms for each other. Rather, inparticular examples, “connected” or “coupled” may be used to indicatethat two or more elements are in direct or indirect physical orelectrical contact with each other. “Coupled” may also mean that two ormore elements are not in direct contact with each other, but yet stillco-operate or interact with each other.

Unless specifically stated otherwise, it may be appreciated thatthroughout the specification terms such as “processing,” “computing,”“calculating,” “determining,” or the like, refer to the action and/orprocesses of a computer or computing system, or similar electroniccomputing device, that manipulate and/or transform data represented asphysical, such as electronic, quantities within the computing system'sregisters and/or memories into other data similarly represented asphysical quantities within the computing system's memories, registers orother such information storage, transmission or display devices.

In a similar manner, the term “processor” may refer to any device orportion of a device that processes electronic data from registers and/ormemory and transform that electronic data into other electronic datathat may be stored in registers and/or memory. As non-limiting examples,“processor” may be a Central Processing Unit (CPU) or a GraphicsProcessing Unit (GPU). A “computing platform” may comprise one or moreprocessors. As used herein, “software” processes may include, forexample, software and/or hardware entities that perform work over time,such as tasks, threads, and intelligent agents. Also, each process mayrefer to multiple processes, for carrying out instructions in sequenceor in parallel, continuously or intermittently. The terms “system” and“method” are used herein interchangeably insofar as the system mayembody one or more methods and the methods may be considered a system.

In the present document, references may be made to obtaining, acquiring,receiving, or inputting analog or digital data into a subsystem,computer system, or computer-implemented machine. The process ofobtaining, acquiring, receiving, or inputting analog and digital datacan be accomplished in a variety of ways such as by receiving the dataas a parameter of a function call or a call to an applicationprogramming interface. In some implementations, the process ofobtaining, acquiring, receiving, or inputting analog or digital data canbe accomplished by transferring the data via a serial or parallelinterface. In another implementation, the process of obtaining,acquiring, receiving, or inputting analog or digital data can beaccomplished by transferring the data via a computer network from theproviding entity to the acquiring entity. References may also be made toproviding, outputting, transmitting, sending, or presenting analog ordigital data. In various examples, the process of providing, outputting,transmitting, sending, or presenting analog or digital data can beaccomplished by transferring the data as an input or output parameter ofa function call, a parameter of an application programming interface orinterprocess communication mechanism.

Although the discussion above sets forth example implementations of thedescribed techniques, other architectures may be used to implement thedescribed functionality, and are intended to be within the scope of thisdisclosure. Furthermore, although specific distributions ofresponsibilities are defined above for purposes of discussion, thevarious functions and responsibilities might be distributed and dividedin different ways, depending on circumstances.

Furthermore, although the subject matter has been described in languagespecific to structural features and/or methodological acts, it is to beunderstood that the subject matter defined in the appended claims is notnecessarily limited to the specific features or acts described. Rather,the specific features and acts are disclosed as exemplary forms ofimplementing the claims.

What is claimed is:
 1. A system-on-a-chip (SoC), comprising: a centralprocessing unit (CPU) to perform a compiler to generate code toaccelerate matrix operations; memory; a Peripheral ComponentInterconnect (PCI) communication bus; and a graphics processing unit(GPU) including a general processing cluster (GPC), where the GPCincludes streaming multiprocessors (SMs) comprising: an instructioncache; a dispatch unit; cores; a load/store unit (LSU); shared memory;an L1 cache; and wherein the compiler is to: obtain a computer program;extract a polyhedral representation of the computer program; determine atransformation schedule using the polyhedral representation; andgenerate executable code based on the transformation schedule and aprocessor architecture.
 2. The SoC of claim 1, wherein the GPU furthercomprises a scheduler unit.
 3. The SoC of claim 1, wherein the compileris to further obtain one or more configuration files comprisingparameters to help determine the transformation schedule.
 4. The SoC ofclaim 1, wherein the polyhedral representation of the computer programis a directed acyclic graph (DAG).
 5. The SoC of claim 1, wherein theGPU further comprises a memory partition unit.
 6. The SoC of claim 1,wherein the GPU further comprises a crossbar (Xbar).
 7. The SoC of claim1, wherein the GPU further comprises an input/output (I/O) unit tointerface with the PCI communication bus.
 8. The SoC of claim 1, furthercomprising a hub to interface with one or more GPU interconnects.
 9. TheSoC of claim 1, wherein the GPC further comprises a raster engine. 10.The SoC of claim 1, wherein the SMs each further comprise one or moreinterconnects.
 11. A system, comprising: a central processing unit (CPU)to perform a compiler to generate code to accelerate matrix operations;memory; a Peripheral Component Interconnect (PCI) communication bus; anda graphics processing unit (GPU) including a general processing cluster(GPC), where the GPC includes streaming multiprocessors (SMs)comprising: an instruction cache; a dispatch unit; cores; a load/storeunit (LSU); shared memory; an L1 cache; and wherein the compiler is to:obtain a computer program; extract a polyhedral representation of thecomputer program; determine a transformation schedule using thepolyhedral representation; and generate executable code based on thetransformation schedule and a processor architecture.
 12. The system ofclaim 11, wherein the GPU further comprises a scheduler unit.
 13. Thesystem of claim 11, wherein the SMs further comprise one or more specialfunction units (SFUs).
 14. The system of claim 11, wherein the SMsfurther comprise a register file.
 15. The system of claim 11, furthercomprising one or more display devices.
 16. The system of claim 11,further comprising a network interface.
 17. The system of claim 11,further comprising a hub to interface with one or more GPUinterconnects.
 18. The system of claim 11, wherein the GPC furthercomprises a raster engine.
 19. The system of claim 11, wherein the SMseach further comprise one or more interconnects.
 20. The system of claim11, wherein the compiler is to further obtain one or more configurationfiles comprising parameters to help determine the transformationschedule.